Introduction

Tidying Data

Loading Files From Kaggle Source - Loading all libraries and files required for this analysis.

#------LOAD DATA------------
library(tidyverse)
library(readxl)
library(leaflet)
library(flexdashboard)
library(gridExtra)
library(broom)
library(plotly)

#------LOAD FILES---------
CCodesfile="~/project/Country-Code.xlsx"
Zomatofile="~/project/zomato.xlsx"
CurrencyFile="~/project/CurrencyRates.xlsx"

CCodes=read_excel(CCodesfile) #importing country code file
zomato=read_excel(Zomatofile) #importing main zomato file
Currency=read_excel(CurrencyFile) #importing currency file

The original data set contained varying currencies per country, making it difficult to compare each restaurant uniformly. We created a seperate file containing the United States Dollar ($USD) conversion rates for each country, which will be used to convert current currency figures into new standard ones.

Data Wrangling - Although the data is presented pre-cleaned into appropriate columns, there still remains some level of manipulation that needs to be done.

#------ GET RID OF SOME COLUMNS & ADD NEW ONES -------------
zomato=zomato %>% select (-c(Rating_Color,Switch_To_Order_Menu,Address,Locality_Verbose,Currency))
#we remove currency column as it is incorrect and contains too many unique symbols

#add in new columns: Country Names according to Country Code & proper currency names with conversion rates
zomato = zomato %>% left_join(CCodes) 
zomato = zomato %>% left_join(Currency) 

#add in new column: USD Cost (for equal comparison)
zomato = zomato %>% mutate(Avg_Cost_USD = Average_Cost_For_Two*`Conversion Rate (USD)`) 
#Multiplied Currency Column by it's conversion rate

#replacing Cuisines col. with Principal_Cuisines (the first category of cuisines for every country row)
zomato= zomato %>% separate(Cuisines,into=c("Principal_Cuisines"),sep=',') #store "primary" cuisine types into new column and replace old with this

#Make sure rating categories are ordered:
zomato = zomato %>% mutate(Rating_Text=ordered(Rating_Text,levels=c("Excellent","Very Good","Good","Average","Poor")))

We begin by removing unnecessary columns and adding in necessary ones:

  1. Add in countries by country code via the country code file provided by kaggle

  2. Add in our own currency conversion file, allowing us to convert average cost for two into the USD amount and storing it in it’s own column called “Avg Cost USD”.

We notice there were many cuisines listed per restaurant, so to simplify things for analytic purposes, we extract the first cuisine listed by each restaurant.

Removing Missing Values - Since we will focusing on the ratings of restaurants, we remove restaurants with missing aggregate rating values.

#---------REMOVE MISSING VALUES----------
#we notice that there are several "0" rated stores for even expensive places so we decided to remove no rating stores
zomato[zomato == 0] = NA #remove values that have zero
zomato[zomato == "Not rated"] = NA #remove unrated values
zomato = zomato %>% filter(!is.na(Aggregate_Rating))

india.data = zomato %>% filter(Country == "India")

Exploratory Analysis

First Stage Analysis

Linear Regression - Model with all the possible predictor variables in our dataset.

  Aggregate Rating
Predictors Estimates CI p
(Intercept) 3.24 1.53 – 4.95 <0.001
Observations 6504
R2 / adjusted R2 0.597 / 0.537

Since we are aware most of our data comes from New Delhi and most common cuisine is North Indian, we set our baseline for City and Principal Cuisines in respect to this. The first linear regression model includes all the following variables:

  • Aggregate Rating

  • City

  • Principal Cuisine

  • Table Booking Option

  • Online Delivery Option

  • Price Range

  • Average Cost for Two (USD)

This model yields a R squared value of 59.73% and is statistically significant. Included on the left is a table containing all the coefficent estimates for this model. When ordering p-values from lowest to highest, notice majority of the variable estimates returns p-values higher than 0.05.

Diagnostic Graphs (1) - Checking the model’s residual fitted graph for accuracy of this model


Our residuals seem to be rather random, with the bulk of it being around zero. The fitted line helps illustrate the overall trend of the residuals which in our case seems to fall slightly below zero.

Diagnostic Graphs (2) - Checking normality of residuals via QQ Plot


QQ Plot shows that the center region of the residuals are normal but the lower tail deviates heavily.

Diagnostic Graphs (3) - Checking for homoscedasticity via Scale Location


We ideally want to observe a rather straight horizontal line for this plot. Unfortunately, this scale location plot suggests heavy heteroscedasticty.

Drop Test - Checking if any variables can be dropped from the current model

Single term deletions

Model:
Aggregate_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Price_Range + Avg_Cost_USD
                     Df Sum of Sq    RSS   AIC  F value    Pr(>F)    
<none>                            126430 20993                       
City                 19      1092 127522 21011   2.5716 0.0002015 ***
Locality            728     66826 193256 22297   4.1072 < 2.2e-16 ***
Principal_Cuisines   71     15068 141498 21583   9.4956 < 2.2e-16 ***
Has_Table_Booking     1      1298 127728 21058  58.0806 2.932e-14 ***
Has_Online_Delivery   1       150 126580 20999   6.7022 0.0096541 ** 
Price_Range           1         8 126438 20992   0.3380 0.5610341    
Avg_Cost_USD          1      3174 129604 21152 142.0286 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can drop the variable Price Range from our model without it being affected.

Second Stage Analysis

Transforming Response Variable - We apply our first transformation to the dependent variable and observe its effects.

india.data = india.data %>% mutate(Transformed_Rating = log10(Aggregate_Rating/(5-Aggregate_Rating)))
  Transformed Rating
Predictors Estimates CI p
(Intercept) 0.27 -0.68 – 1.22 0.576
Observations 6504
R2 / adjusted R2 0.662 / 0.611

We add a new transformed variable to dataset called “Transformed Rating”. Since our Aggregate Rating level is limited between the values 0 to 5, we set a restriction that prevents rating estimates to go above or beyond this range. The transformation is as follows: \[log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}\]

The transformed linear regression model returns an improved R squared value of 66.2% (a 7% increase from original model).

Drop Test - Checking if any variables can be dropped from the transformed model

Single term deletions

Model:
Transformed_Rating ~ City + Locality + Principal_Cuisines + Has_Table_Booking + 
    Has_Online_Delivery + Avg_Cost_USD
                     Df Sum of Sq   RSS   AIC  F value  Pr(>F)    
<none>                            38947 13333                     
City                 19     257.8 39205 13338   1.9712 0.00710 ** 
Locality            728   27035.9 65983 15306   5.3950 < 2e-16 ***
Principal_Cuisines   71    5385.2 44333 14033  11.0187 < 2e-16 ***
Has_Table_Booking     1    1062.3 40010 13506 154.3239 < 2e-16 ***
Has_Online_Delivery   1      24.1 38971 13335   3.4981 0.06149 .  
Avg_Cost_USD          1    2435.6 41383 13725 353.8218 < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conducting a drop test using F statistic shows we can now drop the variable Has Online Delivery from our transformed model without it being affected.

Diagnostic Graphs (1) - Fitting a residuals graph to check accuracy of the transformed model

Diagnostic Graphs (2) - Checking normality of residuals via QQ Plot


Our QQ Plot illustrates that most of the data does follow a relativly normal distribution of residuals. However, as with our original model, the upper and lower tails deviate from normality.

It’s important to note that most of our variables are heavily skewed, causing these large influenctial outliers. Since we are working with a large sample size, the normality shown by this graph doesn’t raise much red flags.

Diagnostic Graphs (3) - Checking for homoscedasticity


The Scale Location Plot shows that there is still clear heteroscedasticity. Ideally, we want the plots to demonstrate a relatively straight horizontal line - indicating homoscedasticity. Further analysis of our variables can explain how alaraming it is (or isn’t).

Checking Individual Variables (Average Cost for Two) - The linear regression of the Transformed Rating with Average Cost for Two returns a residual plot that follows some form of pattern

Checking Individual Variables (Average Cost for Two) - Referring back to the original plot graph between cost and rating, recall that an exponential relatioship was present. Taking the log of cost helped randomize the residual plot a lot more.

City Regression - Investigating if the predictor variable City can be transformed to improve the model


At first glance, there seems to be some form of pattern with the residuals for the predictor variable: City. However, this residual plot is not concern worthy as the dips illustrated by the fitted line is mainly caused by lack of data for certain parts of the graph. Otherwise, the plots are rather random.

Locality Regression - The residual plot for the regression of Transformed Rating ~ Locality is random and has a mean difference of 0 for the most part

Principal Cuisines Regression - Similar to the regression for City, we experience dips in certain areas of the residual graph due to missing data

Booking Regression - The traditional residual graph for the variable: Has Table Booking is hard to interpret as it is a binary variable

Booking Regression (2) - Using boxplots to graph the differences between the residuals and variable values, we see the median is well below zero - residuals not randomly scattered

Statistically Significant Predictors - List of localities that are significantly better or worse than baseline

# A tibble: 45 x 5
   term                              estimate std.error statistic  p.value
   <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
 1 LocalitySpot 18 Mall, Pimple Sau~   -0.611    0.194      -3.15  1.63e-3
 2 LocalityLaw College Road            -0.452    0.194      -2.33  2.00e-2
 3 LocalityBaner                       -0.421    0.202      -2.08  3.78e-2
 4 LocalityKondapur                    -0.351    0.0802     -4.38  1.21e-5
 5 LocalityOhri' Hitech City           -0.296    0.149      -1.98  4.75e-2
 6 LocalityPark Street Area             0.171    0.0821      2.08  3.77e-2
 7 LocalityLower Parel                  0.181    0.0762      2.38  1.73e-2
 8 LocalityFort                         0.199    0.0963      2.06  3.91e-2
 9 LocalityHill Road, Bandra West       0.248    0.0890      2.79  5.28e-3
10 LocalityVersova, Andheri West        0.248    0.0993      2.50  1.24e-2
# ... with 35 more rows

Statistically Significant Predictors - List of Cities that are significantly better or worse than baseline

# A tibble: 2 x 5
  term       estimate std.error statistic p.value
  <chr>         <dbl>     <dbl>     <dbl>   <dbl>
1 CityJaipur    0.250    0.0793      3.15 0.00165
2 CityMysore   -0.166    0.0781     -2.13 0.0335 

Conclusion

The dataset we chose contained recorded information on 9551 different restaurants from around the world. A sample size this large is bound to contain useful information pertaining to restaurant performance.

We began with data visualization in order to let the data paint its own picture and determine what areas are of interest. At first glance we notice that our data is heavily skewed, with more than 80% of the restaurants being located in India. With such a heavy skew, analyzing the data as a whole would be a mistake since the results of the other countries will be influenced by India.

As such, we isolate the restaurants in India from the rest of the world and further investigate. Our key indicator of restaurant performance is its rating measure - both Rating Text and Aggregate Rating - the first allowing for a general overview and the latter for more accurate regression analysis.

After identifying these trends among restaurants in India, we move on to analyze what variables actually affect restaurant performance. We identified a skew distribution in number of votes per restaurant during our exploratory analysis and take this into consideration when running our regression. We use votes as a weight parameter for our regression in order to obtain a better analysis.

Since we are working with a set range of ratings: 0 to 5, we set a limit to restrict regression estimates from going above or below this level and use this as our transformed response variable: \[0<log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}<5\]

After running diagnostic plot tests with our regression models we identify the following variables have correlation to restaurant rating:

  1. log (Average Cost for Two (USD))
  2. Locality - Khandari is much worse off, while Aminabad has better restaurants
  3. City - Jaipur has better ratings and Mysore has worse ratings than New Dehli
  4. Booking Option - Restaurants with the option to book are slightly better off.

It is important to note that our selected regression model does not directly return an estimated aggregate rating. Let Y be the estimated value the regression model outputs; in order to undo the transformation on our response variable, we must perform the following: \[Estimated Aggregate Rating = \frac{5*10^{Y}}{1+10^{Y}}\]

Discussion

References

---
output: 
  flexdashboard::flex_dashboard:
    theme: bootstrap
    source: embed
---

```{r setup, include=FALSE}
library(flexdashboard)
```
Introduction {data-icon="fa-book-open"}
=====================================

Tidying Data {data-icon="fa-table" .storyboard}
===================================== 

### **Loading Files From Kaggle Source** - Loading all libraries and files required for this analysis.

```{r echo=TRUE, message=FALSE, warning=FALSE}
#------LOAD DATA------------
library(tidyverse)
library(readxl)
library(leaflet)
library(flexdashboard)
library(gridExtra)
library(broom)
library(plotly)

#------LOAD FILES---------
CCodesfile="~/project/Country-Code.xlsx"
Zomatofile="~/project/zomato.xlsx"
CurrencyFile="~/project/CurrencyRates.xlsx"

CCodes=read_excel(CCodesfile) #importing country code file
zomato=read_excel(Zomatofile) #importing main zomato file
Currency=read_excel(CurrencyFile) #importing currency file

```

***
The original data set contained varying currencies per country, making it difficult to compare each restaurant uniformly. We created a seperate file containing the United States Dollar ($USD) conversion rates for each country, which will be used to convert current currency figures into new standard ones.

### **Data Wrangling** - Although the data is presented pre-cleaned into appropriate columns, there still remains some level of manipulation that needs to be done. 

```{r echo=TRUE, warning=FALSE}
#------ GET RID OF SOME COLUMNS & ADD NEW ONES -------------
zomato=zomato %>% select (-c(Rating_Color,Switch_To_Order_Menu,Address,Locality_Verbose,Currency))
#we remove currency column as it is incorrect and contains too many unique symbols

#add in new columns: Country Names according to Country Code & proper currency names with conversion rates
zomato = zomato %>% left_join(CCodes) 
zomato = zomato %>% left_join(Currency) 

#add in new column: USD Cost (for equal comparison)
zomato = zomato %>% mutate(Avg_Cost_USD = Average_Cost_For_Two*`Conversion Rate (USD)`) 
#Multiplied Currency Column by it's conversion rate

#replacing Cuisines col. with Principal_Cuisines (the first category of cuisines for every country row)
zomato= zomato %>% separate(Cuisines,into=c("Principal_Cuisines"),sep=',') #store "primary" cuisine types into new column and replace old with this

#Make sure rating categories are ordered:
zomato = zomato %>% mutate(Rating_Text=ordered(Rating_Text,levels=c("Excellent","Very Good","Good","Average","Poor")))
```

***

We begin by removing unnecessary columns and adding in necessary ones:

1. Add in countries by country code via the country code file provided by kaggle

2. Add in our own currency conversion file, allowing us to convert average cost for two into the USD amount and storing it in it's own column called "Avg Cost USD".

We notice there were many cuisines listed per restaurant, so to simplify things for analytic purposes, we extract the first cuisine listed by each restaurant.

### **Removing Missing Values** - Since we will focusing on the ratings of restaurants, we remove restaurants with missing aggregate rating values.

```{r echo=TRUE, warning=FALSE}
#---------REMOVE MISSING VALUES----------
#we notice that there are several "0" rated stores for even expensive places so we decided to remove no rating stores
zomato[zomato == 0] = NA #remove values that have zero
zomato[zomato == "Not rated"] = NA #remove unrated values
zomato = zomato %>% filter(!is.na(Aggregate_Rating))

india.data = zomato %>% filter(Country == "India")

```


Exploratory Analysis {data-icon="fa-chart-pie" .storyboard}
=====================================

First Stage Analysis {data-icon="fa-chart-bar" .storyboard}
=====================================
### **Linear Regression** - Model with all the possible predictor variables in our dataset.

```{r echo=FALSE}
india.data = india.data %>% mutate(Principal_Cuisines=fct_relevel(Principal_Cuisines,"North Indian"))
india.data = india.data %>% mutate(City=fct_relevel(City,"New Delhi"))

library(sjPlot)
fit1 = lm(Aggregate_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Has_Online_Delivery+Price_Range+Avg_Cost_USD,data=india.data,weights=Votes)
tab_model(fit1, terms = c("(Intercept)"))
fit1.sum = as.data.frame(coef(summary(fit1)))
DT::datatable(fit1.sum, options = list(
  bPaginate = FALSE
))

```

***

Since we are aware most of our data comes from New Delhi and most common cuisine is North Indian, we set our baseline for City and Principal Cuisines in respect to this. The first linear regression model includes all the following variables:

- Aggregate Rating

- City

- Principal Cuisine

- Table Booking Option

- Online Delivery Option

- Price Range

- Average Cost for Two (USD)

This model yields a R squared value of 59.73% and is statistically significant. Included on the left is a table containing all the coefficent estimates for this model. When ordering p-values from lowest to highest, notice majority of the variable estimates returns p-values higher than 0.05.

### **Diagnostic Graphs (1)** - Checking the model's residual fitted graph for accuracy of this model
```{r echo=F}
ggplotly(
  ggplot(data=fit1,aes(y=.resid,x=.fitted))+geom_point()+geom_smooth(se=F)
)
```

***

Our residuals seem to be rather random, with the bulk of it being around zero. The fitted line helps illustrate the overall trend of the residuals which in our case seems to fall slightly below zero. 

### **Diagnostic Graphs (2)** - Checking normality of residuals via QQ Plot
```{r echo = F}
p2 <-ggplot(fit1,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2<-p2+ggtitle("Normal Q-Q")+theme_bw()

ggplotly(p2)
```

***
QQ Plot shows that the center region of the residuals are normal but the lower tail deviates heavily.

### **Diagnostic Graphs (3)** - Checking for homoscedasticity via Scale Location
```{r echo=F}
p3<-ggplot(fit1, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3<-p3+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3<-p3+ylab("SquareRoot|Standardized residuals|")
ggplotly(p3)
p3<-p3+ggtitle("Scale-Location")+theme_bw()
ggplotly(p3)
```

***

We ideally want to observe a rather straight horizontal line for this plot. Unfortunately, this scale location plot suggests heavy heteroscedasticty.

### **Drop Test** - Checking if any variables can be dropped from the current model
```{r echo=F}
drop1(fit1,test="F")
```

***

Conducting a drop test using F statistic shows we can drop the variable *Price Range* from our model without it being affected.

Second Stage Analysis {data-icon="fa-chart-line" .storyboard}
=====================================

### **Transforming Response Variable** - We apply our first transformation to the dependent variable and observe its effects.

```{r echo=TRUE}
india.data = india.data %>% mutate(Transformed_Rating = log10(Aggregate_Rating/(5-Aggregate_Rating)))
```

```{r echo=F}
fit.t = lm(Transformed_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Has_Online_Delivery+Avg_Cost_USD,data=india.data,weights=Votes)
tab_model(fit.t, terms = c("(Intercept)"))
fitt.sum = as.data.frame(coef(summary(fit.t)))
DT::datatable(fitt.sum, options = list(
  bPaginate = FALSE
))

```

***

We add a new transformed variable to dataset called "Transformed Rating". Since our Aggregate Rating level is limited between the values 0 to 5, we set a restriction that prevents rating estimates to go above or beyond this range. The transformation is as follows: $$log_{10}{\frac{Aggregate Rating}{5-Aggregate Rating}}$$

The transformed linear regression model returns an improved R squared value of 66.2% (a 7% increase from original model).

### **Drop Test** - Checking if any variables can be dropped from the transformed model
```{r echo=F}
drop1(fit.t,test="F")
```

***

Conducting a drop test using F statistic shows we can now drop the variable *Has Online Delivery* from our transformed model without it being affected.

### **Diagnostic Graphs (1)** - Fitting a residuals graph to check accuracy of the transformed model

```{r echo=F}
fit.2t = lm(Transformed_Rating~City+Locality+Principal_Cuisines+Has_Table_Booking+Avg_Cost_USD,data=india.data,weights=Votes)

ggplotly(
  ggplot(data=fit.2t,aes(y=.resid,x=.fitted))+geom_point()+geom_smooth(se=F)
)
```

### **Diagnostic Graphs (2)** - Checking normality of residuals via QQ Plot
```{r echo = F}
p2t <-ggplot(fit.2t,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+xlab("Theoretical Quantiles")+ylab("Standardized Residuals")
p2t<-p2t+ggtitle("Normal Q-Q")+theme_bw()

ggplotly(p2t)
```

***

Our QQ Plot illustrates that most of the data does follow a relativly normal distribution of residuals. However, as with our original model, the upper and lower tails deviate from normality. 

It's important to note that most of our variables are heavily skewed, causing these large influenctial outliers. Since we are working with a large sample size, the normality shown by this graph doesn't raise much red flags.


### **Diagnostic Graphs (3)** - Checking for homoscedasticity
```{r echo=F}
p3t<-ggplot(fit.2t, aes(.fitted, sqrt(abs(.stdresid))))+geom_point(na.rm=TRUE)
p3t<-p3t+stat_smooth(method="loess", na.rm = TRUE)+xlab("Fitted Value")
p3t<-p3t+ylab("SquareRoot|Standardized residuals|")
p3t<-p3t+ggtitle("Scale-Location")+theme_bw()
ggplotly(p3t)
```

***

The Scale Location Plot shows that there is still clear heteroscedasticity. Ideally, we want the plots to demonstrate a relatively straight horizontal line - indicating homoscedasticity. Further analysis of our variables can explain how alaraming it is (or isn't).

### **Checking Individual Variables (Average Cost for Two)** - The linear regression of the Transformed Rating with Average Cost for Two returns a residual plot that follows some form of pattern

```{r echo = F}
costreg = lm(Transformed_Rating~Avg_Cost_USD,data=india.data,weights = Votes)
#check residual fitted plot
ggplotly(ggplot(data=costreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))
```

### **Checking Individual Variables (Average Cost for Two)** - Referring back to the original plot graph between cost and rating, recall that an exponential relatioship was present. Taking the log of cost helped randomize the residual plot a lot more.

```{r echo = F}
costreg2 = lm(Transformed_Rating~log(Avg_Cost_USD),data=india.data,weights = Votes)
ggplotly(ggplot(data=costreg2,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))

```

### **City Regression** - Investigating if the predictor variable City can be transformed to improve the model

```{r echo = F}
cityreg = lm(Transformed_Rating~City,data=india.data,weights = Votes)
#check residual fitted plot
ggplotly(ggplot(data=cityreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))

```

***

At first glance, there seems to be some form of pattern with the residuals for the predictor variable: City. However, this residual plot is not concern worthy as the dips illustrated by the fitted line is mainly caused by lack of data for certain parts of the graph. Otherwise, the plots are rather random.


### **Locality Regression** - The residual plot for the regression of Transformed Rating ~ Locality is random and has a mean difference of 0 for the most part

```{r echo = F}
locreg = lm(Transformed_Rating~Locality,data=india.data,weights = Votes)
#check residual fitted plot
ggplotly(ggplot(data=locreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))
```

### **Principal Cuisines Regression** - Similar to the regression for City, we experience dips in certain areas of the residual graph due to missing data

```{r echo = F}
cuisinereg = lm(Transformed_Rating~Principal_Cuisines,data=india.data,weights = Votes)
#check residual fitted plot
ggplotly(ggplot(data=cuisinereg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))

```


### **Booking Regression** - The traditional residual graph for the variable: Has Table Booking is hard to interpret as it is a binary variable

```{r echo = F}
bookreg = lm(Transformed_Rating~Has_Table_Booking,data=india.data,weights = Votes)
#check residual fitted plot
ggplotly(ggplot(data=bookreg,aes(y=.resid, x=.fitted))+geom_point()+geom_smooth(se=F))
```


### **Booking Regression (2)** - Using boxplots to graph the differences between the residuals and variable values, we see the median is well below zero - residuals not randomly scattered

```{r echo = F}
#boxplot - booking vs residual
library(broom)
bookreg.aug = augment(bookreg) 
ggplotly(ggplot(bookreg.aug,aes(x=Has_Table_Booking,y=.resid))+geom_boxplot()) 

ggplotly(ggplot(bookreg.aug,aes(sample=.resid))+stat_qq()+
  stat_qq_line()+facet_wrap(~Has_Table_Booking))

```

### **Statistically Significant Predictors** - List of *localities* that are significantly better or worse than baseline

```{r echo=FALSE}
tidy(fit.2t) %>% filter(p.value<0.05) %>% filter(str_detect(term,"Locality")) %>%  arrange(estimate)
```

### **Statistically Significant Predictors** - List of *Cities* that are significantly better or worse than baseline
```{r echo=F}
reg.city = tidy(fit.2t) %>% filter(p.value<0.05) %>% filter(str_detect(term,"City")) %>%  arrange(term)
head (reg.city, 2)
```


Conclusion {data-icon="fa-book"}
=====================================
The dataset we chose contained recorded information on 9551 different restaurants from around the world. A sample size this large is bound to contain useful information pertaining to restaurant performance.

We began with data visualization in order to let the data paint its own picture and determine what areas are of interest. At first glance we notice that our data is heavily skewed, with more than 80% of the restaurants being located in India. With such a heavy skew, analyzing the data as a whole would be a mistake since the results of the other countries will be influenced by India.

As such, we isolate the restaurants in India from the rest of the world and further investigate. Our key indicator of restaurant performance is its rating measure - both Rating Text and Aggregate Rating - the first allowing for a general overview and the latter for more accurate regression analysis.

After identifying these trends among restaurants in India, we move on to analyze what variables actually affect restaurant performance. We identified a skew distribution in number of votes per restaurant during our exploratory analysis and take this into consideration when running our regression. We use votes as a weight parameter for our regression in order to obtain a better analysis. 

Since we are working with a set range of ratings: 0 to 5, we set a limit to restrict regression estimates from going above or below this level and use this as our transformed response variable:
$$0